Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Nov 19, 2025

📄 21% (0.21x) speedup for _sample in datacompy/fugue.py

⏱️ Runtime : 10.3 milliseconds 8.57 milliseconds (best of 201 runs)

📝 Explanation and details

The optimization achieves a 20% speedup through three key improvements:

What was optimized:

  1. Cached len(df) computation - Stored in df_len variable to avoid repeated calls
  2. Eliminated unnecessary reset_index() calls - Added a check to return the DataFrame directly when it already has a default RangeIndex (start=0, step=1)
  3. Used ignore_index=True in df.sample() - Replaced the separate reset_index(drop=True) call with pandas' built-in index resetting

Why it's faster:

  • Cache hit on length check: Eliminates redundant DataFrame length calculations
  • Zero-copy optimization: When DataFrames already have the correct index format, we avoid creating a new DataFrame copy entirely (26 out of 30 cases in profiling hit this fast path)
  • Single index operation: Using ignore_index=True in sample() is more efficient than calling sample() followed by reset_index(drop=True)

Performance impact based on test results:

The optimization shows dramatic improvements when returning all rows (1000%+ faster in many cases) due to the zero-copy path, and modest but consistent 8-13% improvements when actually sampling. This makes sense given that the _sample function is called in data comparison reporting workflows where it processes both unique row samples and mismatch samples - operations that frequently return the entire DataFrame when sample counts are large relative to data size.

Hot path relevance:

Based on the function references, _sample is called multiple times during report generation (_get_compare_result, _aggregate_stats) and appears to be in performance-critical data comparison workflows where these micro-optimizations compound across multiple sampling operations.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 39 Passed
⏪ Replay Tests 22 Passed
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
import pandas as pd

# imports
import pytest
from datacompy.fugue import _sample

# unit tests

# 1. Basic Test Cases


def test_sample_returns_all_when_sample_count_equals_len():
    # DataFrame with 3 rows, sample_count == 3
    df = pd.DataFrame({"a": [1, 2, 3]})
    codeflash_output = _sample(df, 3)
    result = codeflash_output  # 37.9μs -> 3.34μs (1035% faster)


def test_sample_returns_all_when_sample_count_greater_than_len():
    # DataFrame with 2 rows, sample_count == 5
    df = pd.DataFrame({"a": [10, 20]})
    codeflash_output = _sample(df, 5)
    result = codeflash_output  # 35.4μs -> 2.68μs (1222% faster)


def test_sample_returns_sampled_rows_when_sample_count_less_than_len():
    # DataFrame with 5 rows, sample_count == 2
    df = pd.DataFrame({"a": [1, 2, 3, 4, 5]})
    codeflash_output = _sample(df, 2)
    result = codeflash_output  # 304μs -> 278μs (9.41% faster)
    # Check that the sampled values are as expected for random_state=0
    expected = df.sample(n=2, random_state=0).reset_index(drop=True)


def test_sample_preserves_columns_and_types():
    # DataFrame with different types
    df = pd.DataFrame({"a": [1, 2, 3], "b": ["x", "y", "z"], "c": [1.1, 2.2, 3.3]})
    codeflash_output = _sample(df, 2)
    result = codeflash_output  # 310μs -> 274μs (13.0% faster)


# 2. Edge Test Cases


def test_sample_on_empty_dataframe():
    # Empty DataFrame, any sample_count
    df = pd.DataFrame({"a": []})
    codeflash_output = _sample(df, 5)
    result = codeflash_output  # 35.0μs -> 2.98μs (1073% faster)


def test_sample_count_zero_returns_empty_dataframe():
    # DataFrame with rows, sample_count == 0
    df = pd.DataFrame({"a": [1, 2, 3]})
    codeflash_output = _sample(df, 0)
    result = codeflash_output  # 283μs -> 257μs (10.1% faster)


def test_sample_count_negative_returns_empty_dataframe():
    # DataFrame with rows, sample_count == -1 (should behave like n=-1 in pandas, which errors)
    df = pd.DataFrame({"a": [1, 2, 3]})
    with pytest.raises(ValueError):
        _sample(df, -1)  # 140μs -> 141μs (0.726% slower)


def test_sample_on_dataframe_with_duplicate_rows():
    # DataFrame with duplicate rows, sample_count < len
    df = pd.DataFrame({"a": [1, 2, 2, 3, 3, 3]})
    codeflash_output = _sample(df, 3)
    result = codeflash_output  # 293μs -> 267μs (9.65% faster)


def test_sample_on_dataframe_with_single_row():
    df = pd.DataFrame({"a": [99]})
    codeflash_output = _sample(df, 1)
    result = codeflash_output  # 35.7μs -> 2.80μs (1174% faster)


def test_sample_count_larger_than_maxint():
    # DataFrame with 3 rows, sample_count is very large
    df = pd.DataFrame({"a": [1, 2, 3]})
    codeflash_output = _sample(df, 10**9)
    result = codeflash_output  # 34.9μs -> 2.62μs (1230% faster)


# 3. Large Scale Test Cases


def test_sample_large_dataframe_sample_smaller():
    # DataFrame with 1000 rows, sample_count = 100
    df = pd.DataFrame({"a": list(range(1000))})
    codeflash_output = _sample(df, 100)
    result = codeflash_output  # 319μs -> 289μs (10.2% faster)
    # Deterministic sample
    expected = df.sample(n=100, random_state=0).reset_index(drop=True)


def test_sample_large_dataframe_sample_equals_len():
    # DataFrame with 1000 rows, sample_count = 1000
    df = pd.DataFrame({"a": list(range(1000))})
    codeflash_output = _sample(df, 1000)
    result = codeflash_output  # 36.5μs -> 3.11μs (1072% faster)


def test_sample_large_dataframe_sample_greater_than_len():
    # DataFrame with 1000 rows, sample_count = 1500
    df = pd.DataFrame({"a": list(range(1000))})
    codeflash_output = _sample(df, 1500)
    result = codeflash_output  # 35.1μs -> 2.87μs (1125% faster)


def test_sample_large_dataframe_sample_zero():
    # DataFrame with 1000 rows, sample_count = 0
    df = pd.DataFrame({"a": list(range(1000))})
    codeflash_output = _sample(df, 0)
    result = codeflash_output  # 306μs -> 278μs (10.0% faster)


# Additional edge: test that index is always reset
@pytest.mark.parametrize(
    "size,sample_count",
    [
        (10, 5),
        (10, 10),
        (10, 15),
        (0, 0),
    ],
)
def test_index_is_reset(size, sample_count):
    df = pd.DataFrame({"a": list(range(size))})
    codeflash_output = _sample(df, sample_count)
    result = codeflash_output  # 406μs -> 281μs (44.3% faster)


# Additional edge: test with multiple columns and types
def test_sample_multiple_columns_types():
    df = pd.DataFrame(
        {
            "int": range(20),
            "float": [x + 0.5 for x in range(20)],
            "str": [chr(65 + (x % 26)) for x in range(20)],
            "bool": [x % 2 == 0 for x in range(20)],
        }
    )
    codeflash_output = _sample(df, 7)
    result = codeflash_output  # 353μs -> 313μs (12.6% faster)
    # All values must be in original
    for col in df.columns:
        pass


# Additional edge: test with unsorted index
def test_sample_unsorted_index():
    df = pd.DataFrame({"a": [10, 20, 30, 40]}, index=[5, 2, 9, 7])
    codeflash_output = _sample(df, 2)
    result = codeflash_output  # 288μs -> 260μs (10.9% faster)


# Additional edge: test with DataFrame with no columns
def test_sample_no_columns():
    df = pd.DataFrame(index=[0, 1, 2])
    codeflash_output = _sample(df, 2)
    result = codeflash_output  # 243μs -> 223μs (8.98% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
import pandas as pd

# imports
import pytest
from datacompy.fugue import _sample

# unit tests

# --- Basic Test Cases ---


def test_sample_less_than_count_returns_all():
    """If sample_count is greater than the DataFrame length, return all rows, reset index."""
    df = pd.DataFrame({"a": [1, 2, 3]})
    codeflash_output = _sample(df, 5)
    result = codeflash_output  # 37.7μs -> 3.68μs (923% faster)


def test_sample_equal_count_returns_all():
    """If sample_count equals DataFrame length, return all rows, reset index."""
    df = pd.DataFrame({"a": [1, 2, 3]})
    codeflash_output = _sample(df, 3)
    result = codeflash_output  # 35.4μs -> 2.89μs (1127% faster)


def test_sample_greater_than_count_returns_all():
    """If sample_count is much greater than DataFrame length, return all rows, reset index."""
    df = pd.DataFrame({"a": [1, 2, 3]})
    codeflash_output = _sample(df, 100)
    result = codeflash_output  # 34.9μs -> 2.86μs (1120% faster)


def test_sample_smaller_than_count_returns_sample():
    """If sample_count is less than DataFrame length, return a sample of that size, reset index."""
    df = pd.DataFrame({"a": list(range(10))})
    codeflash_output = _sample(df, 4)
    result = codeflash_output  # 305μs -> 283μs (7.83% faster)
    # Should be a subset of the original rows
    for val in result["a"]:
        pass
    # Deterministic: always same output due to random_state=0
    expected = df.sample(n=4, random_state=0).reset_index(drop=True)


def test_sample_returns_new_index():
    """Returned DataFrame should always have a RangeIndex starting from 0."""
    df = pd.DataFrame({"a": [10, 20, 30, 40]})
    codeflash_output = _sample(df, 2)
    result = codeflash_output  # 282μs -> 260μs (8.48% faster)


# --- Edge Test Cases ---


def test_sample_count_zero():
    """If sample_count is zero, return empty DataFrame with same columns and reset index."""
    df = pd.DataFrame({"a": [1, 2, 3]})
    codeflash_output = _sample(df, 0)
    result = codeflash_output  # 274μs -> 250μs (9.80% faster)


def test_sample_empty_dataframe():
    """If DataFrame is empty, always return empty DataFrame with same columns and reset index."""
    df = pd.DataFrame({"a": []})
    codeflash_output = _sample(df, 5)
    result = codeflash_output  # 33.3μs -> 3.23μs (930% faster)


def test_sample_with_negative_count():
    """Negative sample_count should raise a ValueError (pandas behavior)."""
    df = pd.DataFrame({"a": [1, 2, 3]})
    with pytest.raises(ValueError):
        _sample(df, -1)  # 142μs -> 143μs (0.286% slower)


def test_sample_dataframe_with_duplicate_rows():
    """Should handle DataFrames with duplicate rows correctly."""
    df = pd.DataFrame({"a": [1, 2, 2, 3, 3, 3]})
    codeflash_output = _sample(df, 4)
    result = codeflash_output  # 300μs -> 271μs (10.8% faster)
    # All sampled values should be from the original DataFrame
    for val in result["a"]:
        pass


def test_sample_dataframe_with_non_default_index():
    """Should ignore original index and always reset index in the result."""
    df = pd.DataFrame({"a": [1, 2, 3]}, index=[10, 20, 30])
    codeflash_output = _sample(df, 2)
    result = codeflash_output  # 289μs -> 264μs (9.60% faster)
    # Values should be from the original DataFrame
    for val in result["a"]:
        pass


def test_sample_dataframe_with_multiple_columns():
    """Should preserve all columns and sample rows as expected."""
    df = pd.DataFrame({"a": range(10), "b": list("abcdefghij")})
    codeflash_output = _sample(df, 5)
    result = codeflash_output  # 309μs -> 272μs (13.6% faster)
    for i, row in result.iterrows():
        pass


def test_sample_dataframe_with_nan_values():
    """Should handle DataFrames containing NaN values."""
    df = pd.DataFrame({"a": [1, float("nan"), 3, 4, float("nan")]})
    codeflash_output = _sample(df, 3)
    result = codeflash_output  # 285μs -> 260μs (9.54% faster)
    for val in result["a"]:
        pass


# --- Large Scale Test Cases ---


def test_sample_large_dataframe_sample_small():
    """Sample a small number of rows from a large DataFrame."""
    df = pd.DataFrame({"a": list(range(1000)), "b": [x % 7 for x in range(1000)]})
    codeflash_output = _sample(df, 10)
    result = codeflash_output  # 305μs -> 279μs (9.50% faster)
    for i, row in result.iterrows():
        pass


def test_sample_large_dataframe_sample_all():
    """Sample all rows from a large DataFrame (should return all rows, index reset)."""
    df = pd.DataFrame({"a": list(range(1000)), "b": [x % 7 for x in range(1000)]})
    codeflash_output = _sample(df, 1000)
    result = codeflash_output  # 38.1μs -> 3.53μs (980% faster)


def test_sample_large_dataframe_sample_more_than_all():
    """Sample more than available rows from a large DataFrame (should return all rows, index reset)."""
    df = pd.DataFrame({"a": list(range(1000)), "b": [x % 7 for x in range(1000)]})
    codeflash_output = _sample(df, 2000)
    result = codeflash_output  # 36.4μs -> 3.12μs (1066% faster)


def test_sample_large_dataframe_sample_half():
    """Sample half of the rows from a large DataFrame."""
    df = pd.DataFrame({"a": list(range(1000)), "b": [x % 7 for x in range(1000)]})
    codeflash_output = _sample(df, 500)
    result = codeflash_output  # 321μs -> 290μs (10.5% faster)


def test_sample_large_dataframe_deterministic():
    """Sampling should be deterministic with random_state=0."""
    df = pd.DataFrame({"a": list(range(1000))})
    codeflash_output = _sample(df, 100)
    result1 = codeflash_output  # 307μs -> 282μs (8.90% faster)
    codeflash_output = _sample(df, 100)
    result2 = codeflash_output  # 223μs -> 203μs (9.95% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
⏪ Replay Tests and Runtime
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
test_pytest_teststest_sparktest_helper_py_teststest_fuguetest_fugue_polars_py_teststest_fuguetest_fugue_p__replay_test_0.py::test_datacompy_fugue__sample 3.27ms 2.60ms 25.9%✅

To edit these changes git checkout codeflash/optimize-_sample-mi6jonhm and push.

Codeflash Static Badge

The optimization achieves a **20% speedup** through three key improvements:

**What was optimized:**

1. **Cached `len(df)` computation** - Stored in `df_len` variable to avoid repeated calls
2. **Eliminated unnecessary `reset_index()` calls** - Added a check to return the DataFrame directly when it already has a default RangeIndex (start=0, step=1)  
3. **Used `ignore_index=True` in `df.sample()`** - Replaced the separate `reset_index(drop=True)` call with pandas' built-in index resetting

**Why it's faster:**

- **Cache hit on length check**: Eliminates redundant DataFrame length calculations
- **Zero-copy optimization**: When DataFrames already have the correct index format, we avoid creating a new DataFrame copy entirely (26 out of 30 cases in profiling hit this fast path)
- **Single index operation**: Using `ignore_index=True` in `sample()` is more efficient than calling `sample()` followed by `reset_index(drop=True)`

**Performance impact based on test results:**

The optimization shows dramatic improvements when returning all rows (1000%+ faster in many cases) due to the zero-copy path, and modest but consistent 8-13% improvements when actually sampling. This makes sense given that the `_sample` function is called in **data comparison reporting workflows** where it processes both unique row samples and mismatch samples - operations that frequently return the entire DataFrame when sample counts are large relative to data size.

**Hot path relevance:**

Based on the function references, `_sample` is called multiple times during report generation (`_get_compare_result`, `_aggregate_stats`) and appears to be in performance-critical data comparison workflows where these micro-optimizations compound across multiple sampling operations.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 November 19, 2025 21:59
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Nov 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant